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Abstract 

Super-symmetric tensors - a higher-order extension of scatter matrices - are be¬ 
coming increasingly popular in machine learning and computer vision for mod¬ 
eling data statistics, co-occurrences, or even as visual descriptors Ha. However, 
the size of these tensors are exponential in the data dimensionality, which is a 
significant concern. In this paper, we study third-order super-symmetric tensor 
descriptors in the context of dictionary learning and sparse coding. Our goal is to 
approximate these tensors as sparse conic combinations of atoms from a learned 
dictionary, where each atom is a symmetric positive semi-definite matrix. Apart 
from the significant benefits to tensor compression that this framework provides, 
our experiments demonstrate that the sparse coefficients produced by the scheme 
lead to better aggregation of high-dimensional data, and showcases superior per¬ 
formance on two common computer vision tasks compared to the state-of-the-art. 


1 Introduction 

Recent times have seen an increasing trend in several machine learning and computer vision appli¬ 
cations to use rich data representations that capture the inherent structure or statistics of data. A few 
notable such representations are histograms, strings, covariances, trees, and graphs. The goal of this 
paper is to study a new class of structured data descriptors - third-order super-symmetric tensors - 
in the context of sparse coding and dictionary learning. 

Tensors are often used to capture higher-order moments of data distributions such as the covariance, 
skewness or kurtosis and have been used as data descriptors in several computer vision applications. 
In region covariances ED, a covariance matrix - computed over multi-modal features from image 
regions - is used as descriptors for the regions and is seen to demonstrate superior performance in 
object tracking, retrieval, and video understanding. Given bag-of-words histograms of data vectors 
from an image, a second-order co-occurrence pooling of these histograms captures the occurrences 
of two features together in an image and is recently shown to provide superior performance to 
problems such as semantic segmentation of images, compared to their first-order counterparts El- 
A natural extension of the idea is to use higher-order pooling operators, an extensive experimental 
analysis of which is provided in US- Their paper shows that pooling using third-order super- 
symmetric tensors can significantly improve upon the second-order descriptors, e.g., by more than 
5% MAP on the challenging PASCAL VOC07 dataset. 

However, given that the size of the tensors increases exponentially against the dimensionality of their 
first-order counterpart, efficient representations are extremely important for applications that use 
these higher-order descriptors. To this end, the goal of this paper is to study these descriptors in the 
classical dictionary learning and sparse coding setting ll23l . Using the properties of super-symmetric 
tensors, we formulate a novel optimization objective (Sections in which each third-order data 
tensor is approximated by a sparse non-negative linear combination of positive semi-definite ma- 
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trices. Although our objective is non-convex - typical to several dictionary learning algorithms - 
we show that our objective is convex in each variable, thus allowing a block-coordinate descent 
scheme for the optimization. Experiments (presented in Section]^ on the PASCAL VOC07 dataset 
show that the compressed coefficient vectors produced by our sparse coding scheme preserves the 
discriminative properties of the original tensors and leads to performances competitive to the state- 
of-the-art, while providing superior tensor compression and aggregation. Inspired by the merits of 
third-order pooling techniques proposed in lITSl . we further introduce a novel tensor descriptor for 
texture recognition via the linearization of explicitly defined RBF kernels, and show that sparse cod¬ 
ing of these novel tensors using our proposed framework provides better performance than the state 
of the art descriptors used for texture recognition. 


1.1 Related Work 

Dictionary learning and sparse coding ||8] |23l methods have significantly contributed to improving 
the performance of numerous applications in computer vision and image processing. While these al¬ 
gorithms assume a Euclidean vectorial representation of the data, there have been extensions to other 
data descriptors such as tensors, especially symmetric positive definite matrices ||6l|9]|29l. These 
extensions typically use non-Euclidean geometries defined by a similarity metric that they use for 
comparing the second-order tensors. Popular choices for comparisons of positive definite matri¬ 
ces are the log-determinant divergence ll29l . the log-euclidean metric m, and the affine invariant 
Riemannian metric ll24l . Alas, the third-order tensors considered in this paper are neither positive 
definite nor there are any standard similarity metrics known apart from the Euclidean distance. Thus, 
extending these prior methods to our setting is infeasible, demanding novel formulations. 

Third-order tensors have been used for various tasks. Spatio-temporal third-order tensor on video 
data for activity analysis and gesture recognition is proposed in ITTSI . Non-negative factorization is 
applied to third-order tensors in ll28l for image denoising. Multi-linear algebraic methods for tensor 
subspace learning are surveyed in EOll . Tensor textures are used in the context of texture rendering 
in computer vision applications in ||33l. Similar to eigenfaces for face recognition, multi-linear 
algebra based techniques for face recognition use third-order tensors in ||3^ . However, while these 
applications generally work with a single tensor, the objective of this paper is to learn the underlying 
structure of a large collection of such tensors generated from visual data using the framework of 
dictionary learning and sparse coding, which to the best of our knowledge is a novel proposition. 

In addition to the dictionary learning framework, we also introduce an image descriptor for textures. 
While we are not aware of any previous works that propose third-order tensors as image region 
descriptors, the most similar methods to our approach are i) third-order probability matrices for 
image splicing detection llTSll and ii) third-order global image descriptors which aggregate SIFT into 
the autocorrelation tensor US). In contrast, our descriptor, is assembled from elementary signals 
such as luminance and its first- and second-order derivatives, analogous to covariance descriptors 
ED, but demonstrating superior accuracy. The formulation chosen by us for texture recognition 
also differs from prior works such as kernel descriptors la and convolutional kernel networks ED. 


2 Notations 


Before we proceed, we will briefly review our notations next. We use bold-face upper-case letters 
for third-order tensors, bold-face lower-case letters for vectors, and normal fonts for scalars. Each 
second-order tensor along the third mode of a third-order tensor is called a slice. Using Matlab 
style notation, the s-th slice of X is given by X..g. The operation stands for an outer product 
of a second-order tensor with a vector. For example, ^ = Y f 0 y produces a third-order tensor 
y g X ds X da ^ matrix Y g ^ vector y S , where the s-th slice of ^ is given 

by Yj/s, 2 /s being the s-th dimension of y. Let l-N Stand for an index set of the first N integers. We 
use Sf to denote the space of d x d positive semi-definite matrices and to denote the space of 
super-symmetric tensors of dimensions d x d x d. 

Going by the standard terminology in higher-order tensor factorization literature Q. we define a core 
tensor as the analogous of the singular value matrix in the second-order case. However, unless some 
specific decompositions are enforced, a core tensor need not be diagonal as in the second-order case. 
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To this end, we assume the CP decomposition lIT^ in this paper, which generates a diagonal core 
tensor. 

3 Background 

In this section, we will review super-symmetric tensors and their properties, followed by a brief 
exposition of the method described in ifTSll for generating tensor descriptors for an image. The latter 
will come useful when introducing our new texture descriptors. 

3.1 Third-order Super-symmetric Tensors 

We define a super-symmetric tensor descriptor as follows: 

Definition 1. Suppose x„ G G In represents data vectors from an image, then a third- 

order super-symmetric tensor (TOSST) descriptor X G of these data vectors is given by: 

1 ^ 

n—1 

In object recognition setting, the data vectors are typically SIFT descriptors extracted from the im¬ 
age, while the tensor descriptor is obtained as the third-order autocorrelation tensor via applying Q. 

The following properties of the TOSST descriptor are useful in the sequel. 

Proposition 1. For a TOSST descriptor X G we have: 

1. Super-Symmetry: Xij^k = Xij(i,j,k) for indexes i, j,k and their permutation given by 11. 

2. Every slice is positive semi-definite, that is, X.^.^s S S‘^,for s G Id- 

3. Indefiniteness, i.e., it can have positive, negative, or zero eigenvalues in its core-tensor 
(assuming a CP decomposition Hl6\l ). 

Proof The first two properties can be proved directly from Q. To prove the last property, note that 
for a TOSST tensor X and some z G z 7 ^ 0, we have ((X 01 z) ®2 z) 03 z = 
where is the s-th slice of X. While z^^gZ > 0, the tensor product could be negative for z < 0. 
In the above, denotes tensor product in the i-th mode ina. □ 

Due to the indefiniteness of the tensors, we cannot use some of the well-known distances on the man¬ 
ifold of SPD matrices |[T]1 m 1- Thus, we restrict to using the Euclidean distance for the derivations 
and experiments in this paper. 

Among several properties of tensors, one that is important and is typically preserved by tensor 
decompositions is the tensor rank defined as: 

Definition 2 (Tensor Rank). Given a tensor X G its tensor rank TRank(Ar) is defined 

as the minimum p such that X G Span (Mi, M 2 ,..., Mp) for all s G Id, where Mi G 5^ are 
rank-one. 

4 Problem Formulation 

Suppose we are given data tensors Xn,n G In 7 each Xn G We want to learn a tensor 

dictionary B with atoms 61 , 62 , ..., 6 ^, where each Bk G consists of d-slices. Let the s- 
th slice of the A:-th atom is given by where s G Id, k G Ik- Each slice G 5^. Then, the 
problem of dictionary learning and sparse coding can be formulated as follows: for sparse coefficient 
vectors a" G n G In, 


N 

arg min 
, s n=l 


K 




k=l 


+ A||a" 


( 2 ) 
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Note that in the above formulation, third-order dictionary atoms Bfc are multiplied by single scalars 
a^. For K atoms, each of size (P, there are Kd^ parameters to be learned in the dictionary learn¬ 
ing process. An obvious consequence of this reconstruction process is that we require TV ^ if 
to prevent overfitting. To circumvent this bottleneck and work in regime NS ^ K, the recon¬ 
struction process is amended such that every dictionary atom is represented by an outer product 
of a second-order symmetric positive semi-definite matrix and a vector b^. Such a decompo¬ 
sition/approximation will reduce the number of parameters from Kd^ to K{d?‘ -f d). Using this 
strategy, we re-define dictionary 3 to be represented by atom pairs 3 = {(B^, such that 

3k = Bfe bfc. Note that the tensor rank of 3k under this new representation is equal to the 
Rank(B/j) as formally stated below. 

Proposition 2 (Atom Rank). Let 3 = 3 b and ||b|| ^ 0, then TRank(B) = Rank(B). 

Proof. Rank of 3 is the smallest p satisfying (Bt(g)b). , ^ = B • € Span [Mi, M 2 , ..., Mp) , Vs = 

1,..., d, where Mi G Sf are rank-one. The smallest p satisfying B G Span (Mi, M 2 ,..., Mp) is 
equivalent as multiplication of matrix by any non-zero scalar (B • 6s, bgf 0) does not change eigen¬ 
vectors of this matrix which compose the spanning set. □ 


Using this idea, we can rewrite (|^ as follows; 


arg min 

B 


N 

E 

n—l 


Xr 


K 


/c=l 


2 

+ A||a"||i. 
F 


(3) 


Introducing optimization variables /3^ and rewriting the loss function by taking the slices out, we 
can further rewrite ([^ as: 

N S 

arg min EE 

B „=is=l 

Ot 

subject to /3^ = bfcafc, VA: € Zk and n G Z^. (4) 


K 




k=l 


+ A||a" 


We may rewrite the equality constraints in Q as a proximity constraint in the objective function 
(using a regularization parameter 7), and constrain the sparse coefficients in to be non-negative, 
as each slice is positive semi-definite. This is a standard technique used in optimization, commonly 
referred to as variable splitting. We can also normalize the dictionary atoms Bj, to be positive 
semi-definite and of unit-trace for better numerical stability. In that case, vector atoms b^ can be 
constrained to be non-negative (or even better, (3^>0 instead of a.n > 0 and b^ > 0). Introducing 
chosen constraints, we can rewrite our tensor dictionary learning and sparse coding formulation: 


N S 

arg min EE 

® s^l 


K 


x^-EBfc/sr 




■7lir’"-b*©a”||^ + A||a"||,, 


subject to Bfc^0, Tr(Bfe)<l, ||bfc||;^ < 1, k=l,...,K. 


(5) 


In the above equation, operator 0 is an element-wise multiplication between vectors. Coefficients 
7 and A control the quality of the proximity and sparsity of respectively. We might replace the 
loss function using a loss based on Riemannian geometry of positive definite matrices El, in which 
case the convexity would be lost and thus convergence cannot be guaranteed. The formulation ^ is 
convex in each variable and can be solved efficiently using the block coordinate descend. 

Remark 1 (Non-symmetric Tensors). Note that non-symmetric G Kdixd 2 xd 3 coded if 

the positive semi-definite constraint on B^ is dropped, B^ G and bfc G Other non¬ 

negative constraints can also be then removed. While this case is a straightforward extension of our 
formulation, a thorough investigation into it is left as future work. 
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Optimization 


We propose a block-coordinate descent scheme to solve in which each variable in the formula¬ 
tion is solved separately, while keeping the other variables fixed. As each sub-problem is convex, 
we are guaranteed to converge to a local minimum. Initial values of B^, b^, and a" are chosen 
randomly from the uniform distribution within a prespecified range. Vectors are initialized 
with the Lasso algorithm. Next, we solve four separate convex sub-problems resulting in updates of 
Bfc, bfc, ct". For learning second-order matrices B^, we employ the PQN solver lIZTl which 
enables us to implement and apply projections into the SPD cone, handle the trace norm or even 
the rank of matrices (if this constraint is used). Similarly, the projected gradient is used to keep bfc 
inside the simplex ||bfe||i < 1. For the remaining sub-problems, we use the L-BFGS-B solver ||4| 
which enables us to handle simple box constraints e.g., we split coding for a" into two non-negative 
sub-problems a" and a" by applying box constraints a" > 0 and a" > 0. Finally, we obtain 
a" = a" — a" after minimization. Moreover, for the coding phase (no dictionary learning), we fix 
{(Bfc, The algorithm proceeds as above, but without updates of Bfc and bfc. Algorithmj^ 

shows the important steps of our algorithm; this includes both the dictionary learning and the sparse 
coding sub-parts. 


Algorithm 1: Third-order Sparse Dictionary Learning and Coding. 

Data: N data tensors X = {Xi,Xjsi}, proximity and sparsity constants 7 and A, stepsize p, K 
dictionary atoms B = {(Bfc, bfc)}^j^ if coding, otherwise LearnDict = true 
Result: N sparse coeffs. a = {q;\ ..., a^}, K atoms B = {(Bfc, if LearnDict = true 

Initialization: 

if LearnDict then 

• Uniformly sample slices from X and fill Bfc, V/cGl/f 

• Uniformly sample values from — IjK to 1/K and initialize bfc, VfccXif 

end 

• Vectorize inputs X® and atoms Bfc,VsGX 5 , VneXjv,Vfc£Xi<- 
solve Lasso problem and fill /3®’”, Vs Slg, VtiCIat 

• Uniformly sample values from —1/K\ to 1/KX and fill ,\/nGlN' 

• Objective^'^^ = 0 , t = l 


Main loop: 

while ^ Converged do 
if LearnDict then 


B 


(t+i) 


= nsPD[B<i^-g^ 


,(t) 


,where / is the cost from (|^ and 


projection DsPoiB) = ,B* * = U max(A*, 0)for {U, A*, V) = SVD{B) 

and i7+(b) = max(b, 0) if 11+ is used 




JO 


df 

abi 


end 


^®’"’(*+i)=7T+ ^/3 


s,n,{t) _ „ iV 

'I ap 




,(t) 


,Vs€Xs, VnSXAT, use of 77+is optional 




= 77+ ( - ry 


a/ 

d ot 


(t) 


, VnSXjv, use of 77+is optional 


where B^*^and b*^*^are K matrix and vector atoms 


• Objective^*^ = f B^*\ b^*\ , 

• Converged = EvaluateStopingCriteria (f, Objective^*^, Objective 


) 


end 


5 Theory 

In this section, we analyze some of the theoretical properties of our sparse coding formulation. 
We establish that under an approximation setting, a sparse coding of the data tensor exist in some 
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dictionary for every data point. We also provide bounds on the quality of this approximation and its 
tensor rank. The following lemma comes useful in the sequel. 

Lemma 1 (ll30l). Let Y = ’ where each (pi S and m = L2{d?). Then, there exist 

an ex G K™ and e G (0,1) such that: 


^ (l + e)Y, (6) 

i=l 


where <x has O{dlog{d)/e^) non-zeros. 

Theorem 1. Let X = '^^i{<pi(pf)t^ there exist a second-order dictionary B with 

md atoms and a sparse coefficient vector (3 G with 0(d^ log(d)/e^) non-zeros, such that for 
e G (0,1), 


I) X = '^{Pi(pJ')t(S)/3^ and II) 

i=l 


X-X 




‘■S II ^ 5 


S = 1 


where is the s-th slice of X and (3'‘ G 


Proof To prove 1: each slice Xg = {4>i4>i)4'iy where pf is the s-th dimension of pi. Let </>' = 

piy/pf, then Xg = 4>'iP''i ■ If the slices are to be treated independently (as in (|^), then to 

each slice we can apply Lemma[^that results in sparse coefficient vectors /3® having O{dlog{d)/e^) 
non-zeros. Extending this result to all the d-slices, we obtain the result. 

To prove 11: substituting X with its sparse approximation and using the upper-bound in (|^ for each 
slice, the result follows. □ 


Theorem 2 (Approximation quality via Tensor Rank). Let X be the sparse approximation to X 
obtained by solving Q using a dictionary B and //'Rank(B) represent the maximum of the rank of 
any atom, then 


TRank(Ar) < 


d 

IJ Supp(Xg) 

S = 1 


Rank(B), 


(7) 


where Supp(Xg) is the support set produced by Q/or slice Xg. 


Proof. Rewriting X in terms of its sparse tensor approximation, followed by applying Theorem 
and the definition in (|^, the proof directly follows. □ 


Theorem gives a bound on the rank of approximation of a tensor X hy X (which is a linear 
combination of atoms). Knowing rank of X and X helps check how good the approximation is 
over measuring the simple Euclidean norm between them. This is useful in experiments where we 
impose rank-r (r=l, 2 ,...) constraints on atoms B^ and measures how rank constraints on B^ impacts 
the quality of approximation. 


6 Experiments 

In this section, we present experiments demonstrating the usefulness of our framework. As second- 
order region covariance descriptors (RCD) are the most similar tensor descriptors to our TOSST 
descriptor, we decided to evaluate our performance on applications of RCDs, specifically for texture 
recognition. In the sequel, we propose a novel TOSST texture descriptor. This will follow ex¬ 
periments evaluating this new descriptor to state-of-the-art on two texture benchmarks, namely the 
Brodatz textures ll2^ and the UIUC materials QS). Moreover, experiments illustrating behaviour 
of our dictionary learning are provided. We also demonstrate the adequacy of our framework to 
compression of third-order global image descriptors ifTSll on the challenging PASCAL VOC07 set. 
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6.1 TOSST Texture Descriptors 

Many region descriptors for the texture recognition are described in the literature ||5] fTTl ISTIl . Their 
construction generally consists of the following steps; 


i) For an image I and a region in it TZ, extract feature statistics from the region: if {xi, yi ) represent 
pixel coordinates in TZ, then, a feature vector from TZ will be; 




X — Xq 

W — 1 


2 /- 2/0 J 

/i-l ’ 


dl dl d‘^1 d'^I 
’ dy ’ dx'^ ’ dy'^ 


(8) 


where (xg, yo) and (w, h) are the coordinate origin, width, and height of TZ, respectively, 
ii) Compute a covariance matrix ^(1,7?.) = rati t t where 

(x,y)eTi 

0 is the mean over 4>xy(I-) from that region OTlfTTl . Alternatively, one can simply compute an 
autocorrelation matrix ^(1, 7Z) = rt t 4>xy{1)(f>xy{'i)'^ as described in ifSl fTSll . 

{x,y)en 


In this work, we make modifications to the steps listed above to work with RBF kernels which are 
known to improve the classification performance of descriptor representations over linear methods. 
The concatenated statistics in step (i) can be seen as linear features forming linear kernel Kun'. 

Kun{{x,y,r),{x\y\l’^)) = (I>,y{rf4>x'y'{l^). ( 9 ) 

We go a step further and define the following sum kernel Krt / composed of several RBF kernels G: 


Krbf {{x,y,r), (x',yM^)) = ^ (0^,(1“) - 


( 10 ) 


i=l 


represents so- 


where (f>l.y{I) is the i-th feature in (|^ and GJ,.. (u — u ) = exp(— 

called relative compatibility kernel that measures the compatibility between the features in each of 
the image regions. 

The next step involves linearization of Gaussian kernels G^ for i = 1,..., r to obtain feature maps 
that expresses the kernel Krbf by the dot-product. In what follows, we use a fast approximation 
method based on probability product kernels M- Specifically, we employ the inner product for the 
d'-dimensional isotropic Gaussian distribution; 


G. (u - u') = j ■ (u - 0 (u'- <) dC. (11) 

CGR'*' 

Equation is then approximated by replacing the integral with the sum over Z pivots Ci,..., Cz- 


G<^(U-U ) « (y/wcf){u),y/wcl){u )), ^(u) = 


G, 


u-Ci),...,G^/^(u-Cz) 


1 T 


■ ( 12 ) 


A weighting constant w relates to the width of the rectangle in the numerical integration and is 
factored out by the £2 normalization; 


Ga(u-u ) = 


Gcr(u-U ) 




cj){u) cj){u 


G^(u-u)G^(u'-u') \ ||vTu^(u)|j 2 ’ \\^/wcj){u') 


\U{n)\\,^mu')\\, 
( 13 ) ' 

The task of linearization of each GJ,.. in ( fTO) ! becomes trivial as these kernels operate on one¬ 
dimensional variables {d' = 1). Therefore, we sample uniformly domains of each variable and 
use Z = 5. With this tool at hand, we rewrite kernel K^bf as: 


Krbf{{x,y,r),{x,y,l^)) 


(V , 

\ xy"! x'y 


'y') ’ 


(14) 
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where vector v^y is composed by r sub-vectors 



and each sub-vector i = 1 , ...,t is a 


M |] 2 

result of linearization by equations GIE3- We use a third-order aggregation of v according to 
equation Q to generate our TOSST descriptor for textures. For comparisons, we also aggregate 
4> from equation ([^ into a third-order linear descriptor. See our supplementary material for the 
definition of the third-order global image descriptors ms followed by the derivation of the third- 
order aggregation procedure based on kernel exponentiation. 


6.2 Datasets 

The Brodatz dataset contains 111 different textures each one represented by a single 640x640 image. 
We follow the standard protocol, i.e., each texture image is subdivided into 49 overlapping blocks of 
size 160x160. For training, 24 randomly selected blocks are used per class, while the rest is used for 
testing. We use 10 data splits. The UIUC materials dataset contains 18 subcategories of materials 
taken in the wild from four general categories e.g., bark, fabric, construction materials, and outer 
coat of animals. Each subcategory contains 12 images taken at various scales. We apply a leave- 
one-out evaluation protocol lITSll . The PASCAL VOC07 set consists of 5011 images for training and 
4952 for testing, 20 classes of objects of varied nature e.g., human, cat, chair, train, bottle. 


6.3 Experimental Setup 

TOSST descriptors. In what follows, we evaluate the following variants of the TOSST descriptor 
on the Brodatz dataset; linear descriptor with first- and -second order derivatives as in ([^ |^ (i) but 
without luminance, and ii) with luminance. RBF descriptor as in ([^ [T0| (iii) without luminance, 
and iv) with luminance. Moreover, v) is a variant based on (iv) that uses the opponent color cues for 
evaluations on the UIUC materials set ifTSl . Removing luminance for (i) and (iii) helps emphasize 
the benefit of RBF over linear formulation. First though, we investigate our dictionary learning. 

Dictionary learning. We evaluate several variants of our dictionary learning approach on the Bro¬ 
datz dataset. We use the RBF descriptor without the luminance cue (iii) to prevent saturation in 
performance. We apply patch size 40 with stride 20 to further lower computational complexity, sac¬ 
rificing performance slightly. This results in 49 TOSST descriptors per block. We do not apply any 
whitening, thus preserving the SPD structure of the TOSST slices. 

Figures [I(a)| and [I(b)| plot the accuracy and the dictionary learning objective against an increasing 
size K of the dictionary. We analyze three different regularization schemes on the second-order 
dictionary atoms in Q, namely (1) with only SPD constraints, (2) with SPD and low-rank con¬ 
straints, and (3) without SPD or low-rank constraints. When we enforce Rank(Bfc) < R < d for 
k = 1,..., AT to obtain low-rank atoms Bk by scheme ( 2 ), the convexity w.r.t. is lost because the 
optimization domain in this case is the non-convex boundary of the PD cone. However, the formu¬ 
lation (|^ for schemes (1) and (3) is convex w.r.t. B^. The plots highlight that for lower dictionary 
sizes, variants (1) and (2) perform worse than (3) which exhibits the best performance due to a very 
low objective. However, the effects of overfitting start emerging for larger K. The SPD constraint 
appears to offer the best trade-off resulting in a good performance for both small and large K. In 
Figure [T(^ we further evaluate the performance for a Rank-R constraint given K — 20 atoms. The 
plot highlights that imposing the low-rank constraint on atoms may benefit classification. Note that 
for 15 < i? < 30, the classification accuracy fluctuates up to 2%, which is beyond the error standard 
deviation (< 0.4%) despite any significant fluctuation to the objective in this range. This suggests 
that imposing some structural constraints on dictionary atoms is an important step in dictionary 
learning. Lastly, Ligure [T(d)] shows the classification accuracy for the coding and pooling steps w.r.t. 
A controlling sparsity and suggests that the texture classification favors low sparsity. 



(i) linear, no lum. 

(ii) linear, lum. 

(iii) RBF, no lum. 

(iv) RBF, lum. 

(v) RBF, lum., opp. 

dataset 
ten. size d 
accuracy 

Brodatz 

6 

93.9±0.2% 

Brodatz 

7 

99.4±0.1% 

Brodatz 

30 

99.4±0.2% 

Brodatz 

35 

99.9±0.08% 

UIUC materials 

45 

58.0±4.3% 


Brodatz 

ELBCM 98.72% l26l 

L-^ECM 97.9% mi 

RC 97.7% Ell 

UIUC materials 

SD 43.5% fTSl 

CDF 52.3±4.3% l34l 

RSR 52.8±5.1% m 


Table 1: Evaluations of the proposed TOSST descriptor and comparisons to the state-of-the-art. 
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Figure 1: Parameter sensitivity on Brodatz textures: |l(a)|a nd |l(b)| s how classification accuracy and objective 
values against various dictionary sizes respectively, |l(c)| and |l(d)| show accuracy with fixed K = 20, but 
varying the atom rank R, and sparsity regularization A. 

6.4 Comparison to State-of-the-Art on Textures 

In this experiment, descriptor variants were whitened per block as in E). In each block, we ex¬ 
tract 225 TOSST descriptors with patch size 20 and stride 10, apply our dictionary learning given 
K = 2000 followed by the sparse coding and perform pooling as in Cl prior to classification with 
SVM. Table [^demonstrates our results which highlight that the RBF variant outperforms the linear 
descriptor. This is even more notable when the luminance cue is deliberately removed from de¬ 
scriptors to degrade their expressiveness. Table [T] also demonstrates state-of-the-art results. With 
99.9± 0.08% accuracy, our method is the strongest performer. Additionally, Table [^provides our 
results on the UIUC materials obtained with descriptor variant (v) (described above) and K = 4000 
dictionary atoms. We used TOSST descriptors with patch size 40, stride 20 and obtained 58.0±4.3% 
which outperforms other methods. 


6.5 Signature Compression on PASCAL VOC07 

In this section, we evaluate our method on the PASCAL VOC07 dataset, which is larger and more 
challenging than the textures. To this end, we use the setup in 03 to generate third-order global 
image descriptors (also detailed in our supp. material) with the goal of compressing them using 
our sparse coding framework. In detail, we use SIFT extracted from gray-scale images with radii 
12,16, 24, 32 and stride 4, 6,8,10, reduce SIFT size to 9011, append SPM codes of size 1\D IlfSll 
and obtain the global image signatures which yield baseline score of 61.3% MAP as indicated in 
Table [^ (last column). Next, we learn dictionaries using our setup as in equation (j^l and apply our 
sparse coding to reduce signature dimensionality from 176851 (upper simplex of third-order tensor) 
to sizes from 2K to 25Ar. The goal is to regain the baseline score. Table [^ shows that a learned 
dictionary (DL) of size 25K yields 61.2% MAP at 7x compression rate, which demonstrates how to 
limit redundancy in third-order tensors. For comparison, a random dictionary is about 4.3% worse. 
In Figure [^ plots the impact of the number of atoms (K) against signature compression on this 
dataset. 
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Figure 2: This plot illustrates the impact of K on the signature compression on PASCAL VOC07. Dictionary 
learning (DL) performs better than (Random Diet.) formed by sampling atoms from a distribution of DL. 


K 

2000 

5000 

10000 

18000 

25000 

176851 

DL (MAP%) 
Random Diet. (MAP%) 

51.5% 

47.2% 

54.0% 

50.8% 

58.0% 

53.3% 

59.5% 

55.1% 

61.2% 

57.0% 

61.3% 


Table 2: Evaluations of the compression w.r.t. K on third-order global descriptors and PASCAL VOC07. 


Complexity. It takes about 3.2s to code a third-order tensor of size d = 30 using a dictionary with 
2000 atoms. Note that our current implementation is in MATLAB and the performance is measured 
on a 4-core CPU. Our dictionary learning formulation converges in approximately 50 iterations. 


7 Conclusions 


We presented a novel formulation and an efficient algorithm for sparse coding third-order tensors 
using a learned dictionary consisting of second-order positive semi-definite atoms. Our experiments 
demonstrate that our scheme leads to significant compression of the input descriptors, while not 
sacrificing the classification performance of the respective application. Further, we proposed a novel 
tensor descriptor for texture recognition, which when sparse-coded by our approach, achieves state- 
of-the-art performance on two standard benchmark datasets for this task. 

Acknowledgements. We thank several anonymous reviewers for their very positive reviews and 
insightful questions that helped improve this paper. 


Appendices 

A Third-order Global Image Descriptor lH^ 


In what follows, we outline a global image descriptor from lITSl used in our experiments on the 
PascalVOCO? dataset. Third-order Global Image Descriptor is based on SIFT lfT9ll aggregated into a 
third-order autocorrelation tensor followed by Higher Order Singular Value Decomposition lfT^[T3]l 
used for the purpose of signal whitening. This is achieved by Power Normalisation lEl which is 
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performed on the eigenvalues from so-called core tensor. The following steps from iia are used: 


Vj=(8)r-Vi, yi GX 

(15) 

V = Avg(V.) 

(16) 

iGl 


{E;A) = HOSVD(V) 

(17) 

E = Sgn{E) \E\'^‘ 

(18) 

V — ExiA... XrA 

(19) 

AT = Sgn(V) |Vr^ 

(20) 


Equations (15 assemble a higher-order autocorrelation tensor V per image X. First, the outer 
product (E>r of order r ifThl fTJl [TSl is applied to the local image descriptors V; G from X. This 
results in \X\ autocorrelation matrices or third-order tensors Vi G of rank 1 for r = 2 and r = 3, 
respectively. Rank-1 tensors are then averaged by Avg resulting in tensor V S that represents 
the contents of image X and could be used as a training sample. However, the practical image 
representations have to deal with so-called Z7urstiness which is ‘"the property that a given visual 
element appears more times in an image than a statistically independent model would predict” im. 

Corrective functions such as Power Normalization are known to suppress the burstiness Eimiii]. 
Therefore, equations ( T7p9l l and ( [20l i apply two-stage pooling with eigenvalue- and coefficient- 
wise Power Normalization, respectively. In Equation ([T7|), operator HOSVD : 


decomposes tensor V into a core tensor E of eigenvalues and an orthonormal factor matrix A, which 
can be interpreted as the principal components in r modes. Element-wise Power Normalization is 
then applied by equation ([TSjito eigenvalues E to even out their contributions. Tensor V G is 
then assembled in equation (|19|) by the r-mode product detailed in ifTSll . Lastly, coefficient-wise 
Power Normalization acting on V produces tensor AT G in equation (|20|i. 


In our experiments, we use r = 3 and: i) AT = V when Power Normalization is disabled, e.g., 
7 e = 7 c = 1 or ii) apply the above HOSVD approach if0<7e < 1A0<7 c < 1. Therefore, 
the final descriptor AT in equation ( [20| ) represents an image by a super-symmetric third-order tensor 
which is guaranteed to contain SPD matrix slices if no Power Normalization is used. 


B Derivation of the Third-order Aggregation. 

Simple operations such as rising K^b / to the power of r and applying outer product of order r to 
Vxy form a higher-order tensor of rank-1 denoted as Vxy'- 




V“ V 


X y 


, where V“ = 


xy 


( 21 ) 


The higher-order rank-1 tensors Vxy are aggregated over image regions TiX and TZ^, respectively: 

^ Kyfiix, y, P), {x\ y, I')) « (V“ V") , V“ = Avg V“,, (22) 


{x,y)eTl‘^ 


{x,y)e'R.‘‘ 

where {x\ y' ) G 71^ : x = x - x^ + Xq A y = y - yo + yo- (23) 


The aggregation step in equation ( |22] l provides analogy to both equation ( |T6] l and step (ii) which 
outlines the second-order aggregation step. Equation ( |2^ highlights that the sum kernel is computed 
over locations {x,y) G TlX and corresponding to them locations {x ,y ) G 7Z^. For instance, we 
subtract region origin ag and replace with origin a\ to obtain x from TZ^ that corresponds to x 
from 7?.“. The higher-order autocorrelation tensors V are formed and can be then whitened by 
applying equations ( TTpOl l. In our experiments, we assume the choice of r = 3. Thus, the third- 
order dictionary learning and sparse coding is applied to form so-called mid-level features (one per 
region). Given an image and a set of overlapping regions, we apply pooling na over the mid-level 
features and obtain the final image signature that is used as a data sample to learn a classifier. 
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